The Data

https://www.bls.gov/tus/datafiles_2014.htm

We looked at this last class and got the data to a almost-clean enough format for us to work with. We quickly run the commands that get up to the point where we left off.

load("atusresp.rda")
load("atussum.rda")
load("atuscps.rda")

social_cols <- names(atussum)
social_cols <- grep("t12", social_cols)
social_times <- rowSums(atussum[,social_cols])

x <- atussum[,c(1, 3:8)] # we like these columns!
x$social_time <- social_times
keep <- c("tucaseid", "num_children", "weekly_earn", "diary_date")
y <- atusresp[, keep]
x <- merge(x, y, by="tucaseid", all.x = TRUE)

keep <- c("tucaseid", "age", "sex", "famincome", "hh_size") # use age to check
y <- atuscps[atuscps$atus == 1,keep]
z <- merge(x, y,by="tucaseid", all.x=TRUE)

We were pondering this problem of strange differences in reported age between ATUS and CPS for the same individuals.

table(z$age.x-z$age.y)
## 
##  -44  -39  -38  -36  -33  -32  -31  -30  -29  -28  -27  -24  -22  -21  -20 
##    1    2    1    1    2    1    2    2    1    1    1    3    1    1    1 
##  -19  -18  -17  -16  -15  -14  -13  -12  -11  -10   -9   -8   -7   -6   -5 
##    1    1    2    5    1    2    1    2    2    7    5    3    3    6    3 
##   -4   -3   -2   -1    0    1    2    3    4    5    6    7    8    9   10 
##    6    6    8   48 9436 1914   21    7    8    5    3    2    3    5    5 
##   11   12   13   14   15   16   18   21   22   24   26   27   28   29   30 
##    4    3    3    2    3    2    3    4    2    1    2    2    3    2    4 
##   31   33   37   39   40   44   48   49   57   64 
##    1    1    1    1    1    1    1    1    1    1

We decided that maybe we should toss out all of the individuals with a discrepancy greater than 1 in absolute value. We did the following:

# z <- z[abs(z$age.x-z$age.y) <= 1,]

Remove the -1’s as well.

agediffs <- z$age.x - z$age.y
z <- z[agediffs == 0 | agediffs == 1,] 
z <- z[agediffs %in% 0:1, ] # same as above

Check the ‘sex’ variables:

table(z$sex.x, z$sex.y)
##    
##        M    F
##   M 4928    3
##   F    2 6186

The entire set of variables:

names(z)
##  [1] "tucaseid"     "age.x"        "sex.x"        "edu"         
##  [5] "race"         "hispanic"     "metro"        "social_time" 
##  [9] "num_children" "weekly_earn"  "diary_date"   "age.y"       
## [13] "sex.y"        "famincome"    "hh_size"

Tossing out the duplicate variables:

all_cols <- names(z)
z <- z[, !(all_cols %in% c("age.y", "sex.y"))]
names(z)
##  [1] "tucaseid"     "age.x"        "sex.x"        "edu"         
##  [5] "race"         "hispanic"     "metro"        "social_time" 
##  [9] "num_children" "weekly_earn"  "diary_date"   "famincome"   
## [13] "hh_size"

Let’s clean up the ‘.x’ column names:

gsub("\\.x", "", names(z))
##  [1] "tucaseid"     "age"          "sex"          "edu"         
##  [5] "race"         "hispanic"     "metro"        "social_time" 
##  [9] "num_children" "weekly_earn"  "diary_date"   "famincome"   
## [13] "hh_size"
names(z) <- gsub("\\.x", "", names(z))
names(z)
##  [1] "tucaseid"     "age"          "sex"          "edu"         
##  [5] "race"         "hispanic"     "metro"        "social_time" 
##  [9] "num_children" "weekly_earn"  "diary_date"   "famincome"   
## [13] "hh_size"

Save this file:

write.csv(z, "atus_social.csv", row.names = FALSE)

# x <- read.csv("atus_social.csv", as.is = TRUE)
# names(x)

Looking at Social_Time vs. Age

plot(social_time ~ age, data=z)

We can make the points “transparent” to better visualize denser areas:

plot(social_time ~ age, data=z, col=rgb(0,0,0, 0.2), pch=16)

With a simple linear regression model, we might be fitting the following equation:

\[\widehat{\text{social_time}} = \hat\beta_0 + \hat\beta_1 \text{age}\]

In R, the coefficients can be estimated by using the lm() function.

lm(social_time ~ age, data=z)
## 
## Call:
## lm(formula = social_time ~ age, data = z)
## 
## Coefficients:
## (Intercept)          age  
##     130.799        3.547

According to R, the equation can be written as:

\[\widehat{\text{social_time}} = 130.80 + 3.55 \text{age}\]

plot(social_time ~ age, data=z, col=rgb(0,0,0, 0.2), pch=16)
abline(a=130.80, b=3.55, lwd=2, col="red")

What about some other lines?

plot(social_time ~ age, data=z, col=rgb(0,0,0, 0.2), pch=16)
abline(a=131, b=3.7, lwd=2, col="blue")
abline(a=200, b=0, lwd=2, col="green")

Residual is observed y minus predicted y.

\[e_i = y_i - \hat y_i\]

More precisely,

\[e_i = y_i - (\hat \beta_0 + \hat \beta_1 x_i)\]

For least squares, we want to minimize the sum of the squared residuals. We pick \(\hat \beta_0\) and \(\hat \beta_1\) such that \(RSS=\sum e_i^2\) is minimized.

\[RSS(\beta_0, \beta_1) = \sum_i (y_i - ( \beta_0 + \beta_1 x_i))^2\]

We choose \(\hat\beta_0\) and \(\hat \beta_1\) to minimize the above quantity.

Assumptions for inference using linear regression. LINE assumptions:

Usually we check a histogram of the residuals:

m1 <- lm(social_time ~ age, data=z)
summary(m1)
## 
## Call:
## lm(formula = social_time ~ age, data = z)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -432.31 -158.02  -34.61  128.31 1034.28 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  130.799      5.722   22.86   <2e-16 ***
## age            3.547      0.110   32.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 208 on 11117 degrees of freedom
##   (231 observations deleted due to missingness)
## Multiple R-squared:  0.08552,    Adjusted R-squared:  0.08544 
## F-statistic:  1040 on 1 and 11117 DF,  p-value: < 2.2e-16
hist(m1$resid)

Residuals, mathematically, will sum to 0.

Residuals should appear to have about constant variance throughout. This below is a plot of \(e\) against \(\hat y\).

plot(resid(m1) ~ fitted(m1), col=rgb(0,0,0,0.2),
     pch=16)

Not a great model. For example, this plot is showing that we would be under-predicting for lower values of \(\hat y\).

Try a squared term for age:

m2 <- lm(social_time ~ age + I(age^2), data=z)
m2
## 
## Call:
## lm(formula = social_time ~ age + I(age^2), data = z)
## 
## Coefficients:
## (Intercept)          age     I(age^2)  
##    380.9465      -8.0842       0.1175
x <- data.frame(age = seq(0, 85, by=1))
head(x)
##   age
## 1   0
## 2   1
## 3   2
## 4   3
## 5   4
## 6   5
yhats <- predict(m2, x)
head(yhats)
##        1        2        3        4        5        6 
## 380.9465 372.9798 365.2479 357.7510 350.4889 343.4618
plot(social_time ~ age, data=z, col=rgb(0,0,0,0.2))
lines(x$age, yhats, lwd=2, col="red")